Reliability Threats in VDSM - Shortcomings in Conventional Test and Fault-Tolerance Alternatives

نویسنده

  • Michael Nicolaidis
چکیده

IC technologies are approaching the ultimate limits of silicon in terms of device size, power supply levels and speed. By approaching these limits, they become increasingly sensitive to noise which result on unacceptable rates of soft-errors. Furthermore, defect behavior becomes increasingly complex, resulting on increasing numbers of timing and other spurious faults that can escape detection during fabrication testing. This makes increasingly difficult to achieve acceptable reliability levels for future ICs and maintain acceptable cost and quality for IC testing. One important reliability threat is related to single event transients (SET) and single-event upsets (SEU). An SEU is the consequence of a transient current pulse (single event transient), created when a particle strikes a sensitive node of an integrated circuit. When an SET occurring on a memory cell node flips the state of the cell it is transformed to an SEU. Similarly, when an SET occurring on a node of a logic network is propagated through the gates of the network and is captured by a latch as a logic error, it is transformed to an SEU. Atmospheric neutrons affect the operation of modern ICs even at ground level. A few years ago, the energy of the secondary particles produced by the nuclear reaction of neutrons with the matter of an IC was insufficient to affect its operation. However, as we approached 0.1um and use very low supply voltages, the rates of errors induced by cosmic neutrons became unacceptable. Furthermore, alpha particles produced by the disintegration of unstable isotopes of an IC material and its packaging, are another cause of increasing soft error rates. In addition in today technologies, soft errors concern not only memories (which was the case so far) but also logic. One basic reason for the increased sensitivity of logic parts is the reduction of the device size and the Vdd level. Since both the Vdd level and the circuit nodes capacitance Cnode are reduced, the charge stored on a node (Q = Vdd * Cnode) is reduced drastically. Consequently, a significantly lower charge deposed by a particle strike suffices to flip the logic value of a node creating a transient pulse (single event transient or SET), or to flip the state of a storage cell (single event upset). In the past, the probability of occurrence of a soft error in logic parts was drastically lower than in memories, due to the following reasons: (i) the propagation through logic gates can filter the induced transient pulse, and (ii) a transient pulse propagated through a logic network will result in a logic error only if it reaches the input of a latch simultaneously with the latching edge of the clock. For these reasons, traditionally, only memories have been protected against SEUs, even in a radiation hostile environment like space. Unfortunately, deeper submicron scaling increases drastically the sensitivity of logic networks too. In fact, a transient pulse wider than the logic transition time of a gate propagates through the gate without attenuation. Transient pulses induced by particle strikes have a width of a few hundreds of picoseconds (the exact value depends on the circuit characteristics and a particle’s energy). Since the transition time of logic gates is becoming very short in VDSM, the transient pulses cannot be attenuated even for relatively low energy particles. In addition, as the clock frequencies increase significantly, the probability of latching a transient pulse increases as well. Indeed, the more frequent are the latching edges of the clock, the higher is the probability to have a transient pulse coinciding with a latching edge. Due to these trends, the error rates in logic parts become significant. SETs and SEUs are not due to physical defects. The circuit can perfectly work for the majority of the time but produce errors at random instances. Thus, we cannot use manufacturing (one-time) testing to cope with. As another problem, timing faults are gaining importance VDSM technologies. Process parameter variation and various defect types (shorts, opens...) often affect circuit speed. They increase signal delays and result on timing faults. They may require complex test conditions to be detected, due to the huge number of paths in modern ICs. In addition, some of these faults will be detected only if they are activated in conjunction with other timing critical conditions (e.g. cross talk, ground bounce, ...). This makes ATPG for such faults computationally unfeasible, and test length unrealistic. It becomes unavoidable that an increasing number of circuits with timing faults will pass fabrication tests. In this context, fault-tolerant IC design for soft errors and timing faults becomes mandatory for various application domains. Many of these domains cannot afford the high cost of fault tolerant schemes, such as TMR. Thus, alternative solutions are required. Fortunately, EDAC codes can protect memories at acceptable cost. For logic, the situation is more complex. However, due to the temporary nature of the targeted faults, concurrent error detection based on time redundancy, together with hardware retry mechanisms, can enable cost effective protection of logic. Such approaches will gain in importance in the near future. According to the ITRS roadmap, 1999 Ed., Design, p. 43): amongst "Difficult Challenges in Systems Design (<100nm, beyond 2005)" one can find: "The ability to insert robustness automatically into the design will become a priority as the systems become too large to test functionally at manufacturing exit. The automatic introduction of techniques such as redundant logic for fault tolerance is needed".

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

ارائه یک روش مبتنی بر افزونگی نرم­ افزاری سطح دستورالعمل جهت تشخیص خطاهای روند اجرای برنامه درون و بین بلوکی

Electronic devices in space applications may be Radiation Tolerant or Commercial off-the Shelf (COTS). Due to cost and unavailability in many applications, the latter is usually used. In applications such as spatial ones, the COTS equipment lacks reliability against threats like heavy ion radiation therefore, some alternatives should be considered to make the equipment resistant against the pro...

متن کامل

Vulnerabilities and Threats in Distributed Systems

We discuss research issues and models for vulnerabilities and threats in distributed computing systems. We present four diverse approaches to reducing system vulnerabilities and threats. They are: using fault tolerance and reliability principles for security, enhancing role-based access control with trust ratings, protecting privacy during data dissemination and collaboration, and applying frau...

متن کامل

Fundamental Concepts of Computer System Dependability

Dependability is the system property that integrates such attributes as reliability, availability, safety, security, survivability, maintainability. The aim of the presentation is to summarize the fundamental concepts of dependability. After a historical perspective, definitions of dependability are given. A structured view of dependability follows, according to a) the threats, i.e., faults, er...

متن کامل

A Microprocessor-Based Hybrid Duplex Fault-Tolerant System

Reliability is one of the fundamental considerations in the design of industrial control equipment. The microprocessor-based Hybrid Duplex fault-tolerant System (HDS) proposed in this paper has high reliability to meet this demand although its hardware structure is simple. The hardware configuration of HDS and the fault tolerance of this system are described. The switching control strategies in...

متن کامل

Reliability and Performance Evaluation of Fault-aware Routing Methods for Network-on-Chip Architectures (RESEARCH NOTE)

Nowadays, faults and failures are increasing especially in complex systems such as Network-on-Chip (NoC) based Systems-on-a-Chip due to the increasing susceptibility and decreasing feature sizes. On the other hand, fault-tolerant routing algorithms have an evident effect on tolerating permanent faults and improving the reliability of a Network-on-Chip based system. This paper presents reliabili...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003